Executive Summary

This project aims to explore Airbnb listings throughout New York city with ggplot2 in R. We will also be using Chi-Square Test and ANOVA to look at the relationship between different variables. Airbnb project will use functionalities in R studio to capture insight into the dataset. The data is sourced from Kaggle. It includes the listings from Airbnb’s in New York City. The variables in this dataset include neighborhoods, room type, locations (longitude and latitude), reviews, prices, and availability.

This tutorial will take the reader through a step-by-step process to explore data in R using ggplot2.

At the end of the tutorial, readers will be able to create plots in R using ggplot2 and be able to create an ANOVA model and conduct a Chi-Square test. The objective of this project is to uncover insights into the market of Airbnb’s in NYC and look at any trends in the data.

Example of a scatterplot in ggplot2
Example of a scatterplot in ggplot2

Introduction

Welcome to the ggplot tutorial in R. This lesson will demonstrate how we can use the ggplot2 library in R to create data visualization and run analysis. We will use the Airbnb data set, which has the data for the Airbnb’s in New York City. We will unveil trends in the data and understand more about this exciting data set. Come along as we explore all the Big Apple has to offer.

Objectives

There are a few main objectives of this tutorial including the following:

Key Takeaways

Data Source and Variables

Variables before any data cleaning
Variables before any data cleaning

Basics of ggplot

What is ggplot?

ggplot is a system for creating graphics

Basics

Geoms

Clean the Data

# run the tidyverse package and dplyr which will also be used throughout the tutorial
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

# import data and save as airbnb_data
airbnb_data <- read.csv("airbnb.csv")

# look at the data, it will show how many rows and columns there are 
glimpse(airbnb_data)
## Rows: 48,895
## Columns: 16
## $ id                             <int> 2539, 2595, 3647, 3831, 5022, 5099, 512…
## $ name                           <chr> "Clean & quiet apt home by the park", "…
## $ host_id                        <int> 2787, 2845, 4632, 4869, 7192, 7322, 735…
## $ host_name                      <chr> "John", "Jennifer", "Elisabeth", "LisaR…
## $ neighbourhood_group            <chr> "Brooklyn", "Manhattan", "Manhattan", "…
## $ neighbourhood                  <chr> "Kensington", "Midtown", "Harlem", "Cli…
## $ latitude                       <dbl> 40.64749, 40.75362, 40.80902, 40.68514,…
## $ longitude                      <dbl> -73.97237, -73.98377, -73.94190, -73.95…
## $ room_type                      <chr> "Private room", "Entire home/apt", "Pri…
## $ price                          <int> 149, 225, 150, 89, 80, 200, 60, 79, 79,…
## $ minimum_nights                 <int> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1, 5, 2, 4…
## $ number_of_reviews              <int> 9, 45, 0, 270, 9, 74, 49, 430, 118, 160…
## $ last_review                    <chr> "2018-10-19", "2019-05-21", "", "2019-0…
## $ reviews_per_month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0.40,…
## $ calculated_host_listings_count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 3, …
## $ availability_365               <int> 365, 355, 365, 194, 0, 129, 0, 220, 0, …
# change the names of the variables 
colnames(airbnb_data) <- c("id", "Airbnb_name", "host_id","host_name","borough","neighborhood",
                              "latitude", "longitude","property_type", "price_per_night", "minimum_nights", "number_of_reviews", "last_review_date", "reviews_per_month", "total_host_listings", "availability_365")

# Use the dplyr library to clean the data

# The following functions remove rows that are missing data

airbnb_data <- airbnb_data %>%
  mutate(reviews_per_month = reviews_per_month) %>%
  filter(!is.na(reviews_per_month))

airbnb_data <- airbnb_data %>%
  mutate(last_review_date = last_review_date) %>%
  filter(!is.na(last_review_date))

#check the data again
glimpse(airbnb_data)
## Rows: 38,843
## Columns: 16
## $ id                  <int> 2539, 2595, 3831, 5022, 5099, 5121, 5178, 5203, 52…
## $ Airbnb_name         <chr> "Clean & quiet apt home by the park", "Skylit Midt…
## $ host_id             <int> 2787, 2845, 4869, 7192, 7322, 7356, 8967, 7490, 75…
## $ host_name           <chr> "John", "Jennifer", "LisaRoxanne", "Laura", "Chris…
## $ borough             <chr> "Brooklyn", "Manhattan", "Brooklyn", "Manhattan", …
## $ neighborhood        <chr> "Kensington", "Midtown", "Clinton Hill", "East Har…
## $ latitude            <dbl> 40.64749, 40.75362, 40.68514, 40.79851, 40.74767, …
## $ longitude           <dbl> -73.97237, -73.98377, -73.95976, -73.94399, -73.97…
## $ property_type       <chr> "Private room", "Entire home/apt", "Entire home/ap…
## $ price_per_night     <int> 149, 225, 89, 80, 200, 60, 79, 79, 150, 135, 85, 8…
## $ minimum_nights      <int> 1, 1, 1, 10, 3, 45, 2, 2, 1, 5, 2, 4, 2, 90, 2, 2,…
## $ number_of_reviews   <int> 9, 45, 270, 9, 74, 49, 430, 118, 160, 53, 188, 167…
## $ last_review_date    <chr> "2018-10-19", "2019-05-21", "2019-07-05", "2018-11…
## $ reviews_per_month   <dbl> 0.21, 0.38, 4.64, 0.10, 0.59, 0.40, 3.47, 0.99, 1.…
## $ total_host_listings <int> 6, 2, 1, 1, 1, 1, 1, 1, 4, 1, 1, 3, 1, 1, 1, 1, 1,…
## $ availability_365    <int> 365, 355, 194, 0, 129, 0, 220, 0, 188, 6, 39, 314,…

Plots and Graphs with Customization

Types of Plots

Others: geom_smooth or stat_smooth() = smoothed condition means

Customization:

Here is how you can customize a plot

Themes

Colors

Titles

library(ggplot2)
# run ggplot2 library

#start with ggplot() for each plot 

#Scatter Plots 

# 1.Create a scatter plot using the longitude as x and latitude as y. Use property type as the color and add a title and labels 

ggplot(data = airbnb_data)  +  
  geom_point(mapping=aes(x = longitude, y = latitude, color = property_type)) + labs(title = "Airbnb Locations by Property Type", x = "Longitude", y = "Latitude")

# 2.Create a scatter plot using the minimum nights as x and number of days available throughout the year as y. Use property type as the color and add a title and labels 

ggplot(data = airbnb_data)  +  
  geom_point(mapping=aes(x = minimum_nights, y = availability_365, color = property_type)) + labs(title = "Minimum Nights by Availability", x = "Min Nights", y = "Availability")

# 3.Create a scatter plot for the location using longitude and latitude. Make the plot purple and add a title and labels, this time using ggtitle and xlab/ylab. Use geom_smooth to look for patterns

ggplot(data=airbnb_data) + 
  geom_point(mapping = aes(x=longitude, y=latitude), color = "purple") + 
  xlab("Longitude") + 
  ylab("Latitude") +  
  ggtitle("Scatter Plot of Location") + 
  geom_smooth(method=lm, mapping=aes(x=longitude, y=latitude))  
## `geom_smooth()` using formula = 'y ~ x'

#Box plots

# 1. Create a Box plot of prices by borough and fill with borough 

ggplot(data = airbnb_data)  +  
  geom_boxplot(mapping=aes(x = borough, y = price_per_night, fill = borough)) + labs(title = "Prices by Borough", x = "Borough", y = "Price")

# 2. Create a Box plot of reviews per month by borough and fill with borough

ggplot(data = airbnb_data)  +  
  geom_boxplot(mapping=aes(x = borough, y = reviews_per_month, fill = borough)) + labs(title = "Reviews by Borough", x = "Borough", y = "Reviews per Month")

#Bar plots

# 1. Create a Bar plot of the boroughs

ggplot(data = airbnb_data)  +  
  geom_bar(mapping=aes( x = borough)) + labs(title = "Borough Bar Plot", x = "Borough")

# 2. Create a Bar plot of the property type. 
ggplot(data = airbnb_data)  +  
  geom_bar(mapping=aes(x = property_type)) + labs(title = "Property Type", x = "Property Type")

#Histograms

#1. Create a histogram for the price distribution. Make the plot red. Set the width to 40.

ggplot(airbnb_data, aes(x = price_per_night)) +
  geom_histogram(binwidth = 40, fill = "red") +
  labs(title = "Distribution of Prices", x = "Price", y = "Frequency")

# To save a plot to a file use the ggsave()- save to a file

Faceting

Faceting combines mulitple plots.

facet_grid() = forms a matrix using rows and columns based on the variables. This is used mainly when there are two discrete variables facet_wrap() = In the case that there is only one variable with multiple levels, use this function

Facet Plot Example
Facet Plot Example
# Let's use facet_wrap to combine boxplots for number of reviews with boroughs. 

ggplot(data = airbnb_data)  +  
  geom_boxplot(mapping=aes(y=number_of_reviews)) +
  facet_wrap(~borough)

# facet for price per night using property type and using fill to specify the color of the plots

ggplot(data = airbnb_data) +
  geom_boxplot(mapping= aes(x=price_per_night), fill = "orange") +
  facet_wrap(~property_type)

# facet for price per night using borough and using fill to specify the color of the plots

ggplot(data = airbnb_data) +
  geom_histogram(mapping= aes(x=price_per_night), fill = "purple") +
  facet_wrap(~borough)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Descriptive Statistics

#Let's do some descriptive statistics for some of our variables 

summary(airbnb_data$price_per_night)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    69.0   101.0   142.3   170.0 10000.0
summary(airbnb_data$minimum_nights)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    2.000    5.868    4.000 1250.000
summary(airbnb_data$number_of_reviews)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0     9.0    29.3    33.0   629.0

Inferential Statistics

#Let's check if there is a relationship between property type and borough.

#Chi-Square Test (a = .05): We will use a Chi-Square test to look at relationship between property type and borough.

cross_tab <- table(airbnb_data$property_type, airbnb_data$borough)

result <- chisq.test(cross_tab)
print(result)
## 
##  Pearson's Chi-squared test
## 
## data:  cross_tab
## X-squared = 962.33, df = 8, p-value < 2.2e-16
# Since the p-value is less than alpha, there is statistically significant relationship between room type and borough. 


#ANOVA: We will now use ANOVA to compare the means (prices) between the different boroughs. 

# Create an ANOVA model
model <- aov(price_per_night ~ borough, data = airbnb_data)

# Print a summary of the results
summary(model)
##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## borough         4 4.507e+07 11267628   299.4 <2e-16 ***
## Residuals   38838 1.462e+09    37631                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Based on the results of the ANOVA model, there is evidence that the means of at least one group's mean is significantly different

Data Preparation

There were a few challenges when tidying the data:

Missing Data:

Outliers:

Variable Names:

A few examples:

Conclusion

To conclude, we learned the basics of ggplot in R using the NYC Airbnb from Kaggle. This library allowed us to dig deeper into properties in NYC and uncover insights about Airbnb’s like pricing, locations, and distribution in neighborhoods.We are now able to use ggplot to visualize “the city that never sleeps.”

Follow-Up Resources

If you would like to explore ggplot more, or even other visualization in R check out the recommended resources below:

Appendix A: References